Search Results: "huber"

24 January 2014

Erich Schubert: Definition of Data Science

Everything is "big data" now, and everything is "data science". Because these terms lack a proper, falsifiable definition.

A number of attempts to define them exist, but they usually only consist of a number of "nice to haves" strung together. For Big Data, it's the 3+ V's, and for Data Science, this diagram on Wikipedia is a typical example.

This is not surprising: effectively these term are all marketing, not scientific attempts at definiting a research domain.

Actually, my favorite definition is this, except that it should maybe read pink pony in the middle, instead of unicorn.

Data science has been called "the sexiest job" so often, this has recently led to an integer overflow.

The problem with these definitions is that they are open-ended. They name some examples (like "volume") but they essentially leave it open to call anything "big data" or "data science" that you would like to. This is, of course, a marketers dream buzzword. There is nothing saying that "picking my nose" is not big data science.

If we ever want to get to a usable definition and get rid of all the hype, we should consider a more precise definition; even when this means making it more exclusive (funnily enough, some people already called above open-ended definitions "elitist" ...).

Big data:

Must involve distributed computation on multiple servers
Must intermix computation and data management
Must advance over the state-of-the-art of relational databases, data warehousing and cloud computing in 2005
Must enable results that were unavailable with earlier approaches, or that would take substantially longer (runtime or latency)
Must be disruptively more data-driven

Data science:

Must incorporate domain knowledge (e.g. business, geology, etc.).
Must take computational aspects into account (scalability etc.).
Must involve scientific techniques such as hypothesis testing and result validation.
Results must be falsifiable.
Should involve more mathematics and statistics than earlier approaches.
Should involve more data management than earlier approaches (indexing, sketching&hashing etc.).
Should involve machine learning, AI or knowledge discovery algorithms.
Should involve visualization and rapid prototyping for software development.
Must satisfy at least one of these shoulds in a disruptive level.

But this is all far from a proper definition. Partially because these fields are so much in flux; but largely because they're just too ill-defined.

There is a lot of overlap, that we should try to flesh out. For example, data science is not just statistics. Because it is much more concerned with how data is organized and how the computations can be made efficiently. Yet often, statistics is much better at integrating domain knowledge. People coming from computation, on the other hand, usually care too little about the domain knowledge and falsifiability of their results - they're happy if they can compute anything.

Last but not least, nobody will be in favor of such a rigid definition and requirements. Because most likely, you will have to strip that "data scientist" label off your business card - and why bite the hand that feeds? Most of what I do certainly would not qualify as data science or big data anymore with an "elitist" definition. While this doesn't lessen my scientific results, it makes them less marketable.

Essentially, this is like a global "gentlemans agreement". Buzz these words while they last, then move on to the next similar "trending topic".

Maybe we should just leave these terms to the marketing folks, and let them bubble them till it bursts. Instead, we should just stick to the established and better defined terms...

When you are doing statistics, call it statistics.
When you are doing unsupervised learning, call it machine learning.
When your focus is distributed computation, call it distributed computing.
When you do data management, continute to call it data management and databases.
When you do data indexing, call it data indexing.
When you are doing unsupervised data mining, call it cluster analysis, outlier detection, ...
Whatever it is, try to use a precise term, instead of a buzzword.

Thank you.

Of course, sometimes you will have to play Buzzword Bingo. Nobody is going to stop you. But I will understand that you are doing "playing buzzword bingo", unless you get more precise.

Once you then have results that are so massively better, and really disrupted science, then you can still call it "data science" later on.

You have been seeing, I've been picking on the word "disruptive" a lot. As long as you are doing "business as usual", and focusing on off-the-shelf solution, it will not be disruptive. And it then won't be big data science, or a big data approach that yields major gains. It will be just "business as usual" with different labels, and return results as usual.

Let's face it. We don't just want big data or data science. What everybody is looking for is disruptive results, which will require a radical approach, not a slight modification involving slightly more computers of what you have been doing all along.

22 January 2014

Erich Schubert: The init wars

The init wars have recently caught a lot of media attention (e.g. heise, prolinux, phoronix). However, one detail that is often overlooked: Debian is debating over the default, while all of them are already supported to a large extend, actually. Most likely, at least two of them will be made mandatory to support IMHO.

The discussion seems to be quite heated, with lots of people trying to evangelize for their preferred system. This actually only highlights that we need to support more than one, as Debian has always been about choice. This may mean some extra work for the debian-installer developers, because choosing the init system at install time (instead of switching later) will be much easier. More often than not, when switching from one init system to another you will have to perform a hard reset.

If you want to learn about the options, please go to the formal discussion page, which does a good job at presenting the positions in a neutral way.

Here is my subjective view of the init systems:

SysV init is the current default, and thus deserves to be mentioned first. It is slow, because it is based on a huge series of shell scripts. It can often be fragile, but at the same time it is very transparent. For a UNIX system administrator, SysV init is probably the preferred choice. You only reboot your servers every year anyway.
upstart seems to be a typical Canonical project. It solves a great deal of problems, but apparently isn't good enough at it for everybody, and they fail at including anyone in their efforts. Other examples of these fails include Unity and Mir, where they also announced the project as-is, instead of trying to get other supporters on board early (AFAICT). The key problem to widespread upstart acceptance seems to be the Canonical Contributor License Agreement that many would-be contributors are unwilling to accept. The only alternative would be to fork upstart completely, to make it independent of Canonical. (Note that upstart nevertheless is GPL, which is why it can be used by Debian just fine. The CLA only makes getting patches and enhancements included in the official version hard.)
systemd is the rising star in the init world. It probably has the best set of features, and it has started to incorporate/replace a number of existing projects such as ConsoleKit. I.e. it not only manages services, but also user sessions. It can be loosely tied to the GNOME project which has started to rely on it more and more (much to the unhappyness of Canonical, who used to be a key player for GNOME; note that officially, GNOME chose to not depend on systemd, yet I see this as the only reliable combination to get a complete GNOME system running, and since "systemd can eventually replace gnome-session" I foresee this tie to become closer). As the main drawback, systemd as is will (apparently) only work with the Linux kernel, whereas Debian has to also support kFreeBSD, NetBSD, Hurd and the OpenSolaris kernels (some aren't officially supported by Debian, but by separate projects).

So my take: I believe the only reasonable default is systemd. It has the most active development community and widest set of features. But as it cannot support all architectures, we need mandatory support for an alternative init system, probably SysV. Getting both working reliably will be a pain, in particular since more and more projects (e.g. GNOME) tie themselves closely to systemd, and would then become Linux-only or require major patches.

I have tried only systemd on a number of machines, and unfortunately I cannot report it as "prime time ready" yet. You do have the occasional upgrade problems and incompatibilities, as it is quite invasive. From screensavers activating during movies to double suspends, to being unable to shutdown my system when logged in (systemd would treat the login manager as separate session, and not being the sole user it would not allow me to shut down), I have seen quite a lot of annoyances happen. This is an obvious consequence of the active development on systemd. This means that we should make the decision early, because we will need a lot of time to resolve all these bugs for the release.

There are more disruptions coming on the way. Nobody seems to have talked about kDBUS yet, the integration of an IPC mechanism like DBUS into the Linux kernel. It IMHO has a good chance of making it into the Linux kernel rather soon, and I wouldn't be surprised if it became mandatory for systemd soon after. Which then implies that only a recent kernel (say, mid-2014) version might be fully supported by systemd soon.

I would also like to see less GNOME influence in systemd. I have pretty much given up on the GNOME community, which is moving into a UI direction that I hate: they seem to only care about tablet and mobile phones for dumb users, and slowly turn GNOME into an android UI; selling black background as major UI improvements. I feel that the key GNOME development community does not care about developers and desktop users like me anymore (but dream of being the next Android), and therefore I have abandoned GNOME and switched to XFCE.

I don't give upstart much of a chance. Of course there are some Debian developers already involved in its development (employed by Canonical), so this will cause some frustration. But so far, upstart is largely an Ubuntu-only solution. And just like Mir, I don't see much future in it; instead I foresee Ubuntu going systemd within a few years, because it will want to get all the latest GNOME features. Ubuntu relies on GNOME, and apparently GNOME already has chosen systemd over upstart (even though this is "officially" denied).

Sticking with SysV is obviously the easiest choice, but it does not make a lot of sense to me technically. It's okay for servers, but more and more desktop applications will start to rely on systemd. For legacy reasons, I would however like to retain good SysV support for at least 5-10 more years.

But what is the real problem? After all, this is a long overdue decision.

There is too much advocacy and evangelism, from either side. The CTTE isn't really left alone to do a technical decision, but instead the main factors have become of political nature, unfortunately. You have all kinds of companies (such as Spotify) weigh in on the debate, too.
The tone has become quite aggressive and emotional, unfortunately. I can already foresee some comments on this blog post "you are a liar, because GNOME is now spelled Gnome!!1!".
Media attention. This upcoming decision has been picked up by various Linux media already, increasing the pressure on everybody.
Last but not least, the impact will be major. Debian is one of the largest distributions, last but not least used by Ubuntu and Steam, amongst others. Debian preferring one over the other will be a slap in somebodys face, unfortunately.

So how to solve it? Let the CTTE do their discussions, and stop flooding them with mails trying to influence them. There has been so much influencing going on, it may even backfire. I'm confident they will find a reasonable decision, or they'll decide to poll all the DDs. If you want to influence the outcome provide patches to anything that doesn't yet fully support your init system of choice! I'm sure there are hundreds of packages which do neither have upstart nor systemd support yet (as is, I currently have 27 init.d scripts launched by systemd, for example). IMHO, nothing is more convincing than have things just work, and of course, contributing code. We are in open source development, and the one thing that gets you sympathy in the community is to contribute code to someone elses project. For example, contribute full integrated power-management support into XFCE, if you include power management functionality.

As is, I have apparently 7 packages installed with upstart support, and 25 with systemd support. So either, everybody is crazy about systemd, or they have the better record of getting their scripts accepted upstream. (Note that this straw poll is biased - with systemd, the benefits of not using "legacy" init.d script may just be larger).

16 December 2013

Erich Schubert: Java Hotspot Compiler - a heavily underappreciated technology

When I had my first contacts with Java, probably around Java 1.1 or Java 1.2, it felt all clumsy and slow. And this is still the reputation that Java has to many developers: bloated source code and slow performance.

The last years I've worked a lot with Java; it would not have been my personal first choice, but as this is usually the language the students know best, it was the best choice for this project, the data mining framework ELKI.

I've learned a lot on Java since, also on debugging and optimizing Java code. ELKI contains a number of tasks that require a good number chrunching performance; something where Java particularly had the reputation of being slow.

I must say, this is not entirely fair. Sure, the pure matrix multiplication performance of Java is not up to Fortran (BLAS libraries are usually implemented in Fortran, and many tools such as R or NumPy will use them for the heavy lifting). But there are other tasks than matrix multiplication, too!

There is a number of things where Java could be improved a lot. Some of this will be coming with Java 8, others is still missing. I'd particularly like to see native BLAS support and multi-valued on-stack returns (to allow intrinsic sincos, for example).

In this post, I want to emphasize that usually the Hotspot compiler does an excellent job.

A few years ago, I have always been laughing at those that claimed "Java code can even be faster than C code"; because the Java JVM is written in C. Having had a deeper look at what the hotspot compiler does, I'm now saying: I'm not surprised that quite often, reasonably good Java code outperforms reasonably good C code.

In fact, I'd love to see a "hotspot" optimizer for C.

So what is it what makes Hotspot so fast? In my opinion, the key ingredient to hotspot performance is aggressive inlining. And this is exactly why "reasonably well written" Java code can be faster than C code written at a similar level.

Let me explain this at an example. Assuming we want to compute a pariwise distance matrix; but we want the code to be able to support arbitrary distance functions. The code will roughly look like this (not heavily optimized):

for (int i = 0; i < size; i++)  
  for (int j = i + 1; j < size; j++)  
    matrix[i][j] = computeDistance(data[i], data[j]);

In C, if you want to be able to choose computeDistance at runtime, you would likely make it a function pointer, or in C++ use e.g. boost::function or a virtual method. In Java, you would use an interface method instead, i.e. distanceFunction.distance().

In C, your compiler will most likely emit a jmp *%eax instruction to jump to the method to compute the distance; with virtual methods in C++, it would load the target method from the vtable and then jmp there. Technically, it will likely be a "register-indirect absolute jump". Java will, however, try to inline this code at runtime, i.e. it will often insert the actual distance function used at the location of this call.

Why does this make a difference? CPUs have become quite good at speculative execution, prefetching and caching. Yet, it can still pay off to save those jmps as far as I can tell; and if it is just to allow the CPU to apply these techniques to predict another branch better. But there is also a second effect: the hotspot compiler will be optimizing the inlined version of the code, whereas the C compiler has to optimize the two functions independently (as it cannot know they will be only used in this combination).

Hotspot can be quite aggressive there. It will even inline and optimize when it is not 100% sure that these assumptions are correct. It will just add simple tests (e.g. adding some type checks) and jump back to the interpreter/compiler when these assumptions fail and then reoptimize again.

You can see the inlining effect in Java when you use the -Xcomp flag, telling the Java VM to compile everything at load time. It cannot do as much speculative inlining there, as it does not know which method will be called and which class will be seen. Instead, it will have to compile the code using virtual method invocations, just like C++ would use for executing this. Except that in Java, every single method will be virtual (in C++, you have to be explicit). You will likely see a substantial performance drop when using this flag, it is not recommended to use. Instead, let hotspot perform its inlining magic. It will inline a lot by default - in particular tiny methods such as getters and setters.

I'd love to see something similar in C or C++. There are some optimizations that can only be done at runtime, not at compile time. Maybe not even at linking time; but only with runtime type information, and that may also change over time (e.g. the user first computes a distance matrix for Euclidean distance, then for Manhattan distance).

Don't get me wrong. I'm not saying Java is perfect. There are a lot of common mistakes, such as using java.util.Collections for primitive types, which comes at a massive memory cost and garbage collection overhead. The first thing to debug when optimizing Java applications is to check for memory usage overhead. But all in all, good Java code can indeed perform well, and may even outperform C code, due to the inlining optimization I just discussed; in particular on large projects where you cannot fine-tune inlining in C anymore.

Sometimes, Hotspot may also fail. Which is largely why I've been investigating these issues recently. In ELKI 0.6.0 I'm facing a severe performance regression with linear scans (which is actually the simpler codepath, not using indexes but using a simple loop as seen above). I had this with 0.5.0 before, but back then I was able to revert back to an earlier version that still performed good (even though the code was much more redundant). This time, I would have had to revert a larger refactoring that I wanted to keep, unfortunately.

Because the regression was quite severe - from 600 seconds to 1500-2500 seconds (still clearly better than -Xcomp) - I first assumed I was facing an actual programming bug. Careful inspection down to the assembler code produced by the hotspot VM did not reveal any such error. Then I tried Java 8, and the regression was gone.
So apparently, it is not a programming error, but Java 7 failed at optimizing it remotely as good as it did with the previous ELKI version!
If you are an Java guru, interested at tracking down this regression, feel free to contact me. It's in an open source project, ELKI. I'd be happy to have good performance even for linear-scans, and Java 7. But I don't want to waste any more hours on this, but instead plan to move on to Java 8 for other reasons (lambda expressions, which will greatly reduce the amount of glue coded needed), too. Plus, Java 8 is faster in my benchmarks.

2 November 2013

Erich Schubert: Numerical precision and "Big Data"

Everybody is trying (or pretending) to be "big data" these days. Yet, I have the impression that there are very few true success stories. People fight with the technology to scale, and have not even arrived at the point of doing detailed analysis - or even meta-analysis, i.e. whether the new solutions actuall perform better than the old "small data" approaches.

In fact, a lot of the "big data" approaches just reason "you can't do it with the existing solutions, so we cannot compare". Which is not exactly true.

In my experiments with large data (not big; it still fits into main memory) is that you have to be quite careful with your methods. Just scaling up existing methods does not always yield the expected results. The larger your data set, the more tiny problem surface that can ruin your computations.

"Big data" is often based on the assumption that just by throwing more data at your problem, your results will automatically become more precise. This is not true. On contrary: the larger your data, the more likely you have some contamination that can ruin everything.

We tend to assume that numerics and such issues have long been resolved. But while there are some solutions, it should be noted that they come at a price: they are slower, have some limitations, and are harder to implement.

Unforunately, they are just about everywhere in data analysis. I'll demonstrate it with a very basic example. Assume we want to compute the mean of the following series: [1e20, 3, -1e20]. Computing the mean, everybody should be able to do this, right? Well, let's agree that the true solution is 1, as the first and last term cancel out. Now let's try some variants:

Python, naive: sum([1e10, 3, -1e20])/3 yields 0.0
Python, NumPy sum: numpy.sum([1e10, 3, -1e20])/3 yields 0.0
Python, NumPy mean: numpy.mean([1e10, 3, -1e20]) yields 0.0
Python, less-known function: numpy.fsum([1e10, 3, -1e20])/3 yields 1.0
Java, naive: System.out.println( (1e20+3-1e20)/3 ); yields 0.0
R, mean: mean( c(1e20,3,-1e20) ) yields 0
R, sum: sum( c(1e20,3,-1e20) )/3 yields 0
Octave, mean: mean([1e20,3,-1e20]) yields 0
Octave, sum: sum([1e20,3,-1e20])/3 yields 0

So what is happening here? All of these functions (except pythons less known math.fsum) use double precision. With double precision, 1e20 + 3 = 1e20, as double can only retain 15-16 digits of precision. To actually get the correct result, you need to keep track of your error using additional doubles.

Now you may argue, this would only happen when having large differences in magnitude. Unfortunately, this is not true. It also surfaces when you have a large number of observations! Again, I'm using python to exemplify (because math.fsum is accurate).

> a = array(range(-1000000,1000001)) * 0.000001
> min(a), max(a), numpy.sum(a), math.fsum(a)
(-1.0, 1.0, -2.3807511517759394e-11, 0.0)

As it can be seen from the math.fsum function, solutions exist. For example Shewchuk's algorithm (which is probably what powers math.fsum). For many cases, Kahan summation will also be sufficient, which essentially gives you twice the precision of doubles.

Note that these issues become even worse once you use subtraction, such as when computing variance. never use the famous E[X^2]-E[X]^2 formula. It's mathematically correct, but when your data is not central (i.e. E[X] is not close to 0, and much smaller than your standard deviation) then you will see all kinds of odd errors, including negative variance; which may then yield NaN standard deviation:

> b = a + 1e15
> numpy.var(a), numpy.var(b)
(0.33333366666666647, 0.33594164452917774)
> mean(a**2)-mean(a)**2, mean(b**2)-mean(b)**2
(0.33333366666666647, -21532835718365184.0)

(as you can see, numpy.var does not use the naive single-pass formula; probably they use the classic straight forward two-pass approach)

So why do we not always use the accurate computations? Well, we use floating point with fixed precision because it is fast. And most of the time, when dealing with well conditioned numbers, it is easily accurate enough. To show the performance difference:

> import timeit
> for f in ["sum(a)", "math.fsum(a)"]:
>     print timeit.timeit(f, setup="import math; a=range(0,1000)")
30.6121790409
202.994441986

So unless we need that extra precision (e.g. because we have messy data with outliers of large magnitude) we might prefer the simpler approach which is roughly 3-6x faster (at least as long as pure CPU performance is concerned. Once I/O gets into play, the difference might just disappear altogether). Which is probably why all but the fsum function show the same inaccuracy: performance. In particular, as in 99% of situations the problems won't arise.

Long story. Short takeaway: When dealing with large data, pay extra attention to the quality of your results. In fact, even do so when handling small data that is dirty and contains outliers and different magnitudes. Don't assume that computer math is exact, floating point arithmetic is not.
Don't just blindly scale up approaches that seemed to work on small data. Analyze them carefully. And last but not least, consider if adding more data will actually give you extra precision.

27 September 2013

Erich Schubert: Big Data madness and reality

"Big Data" has been hyped a lot, and due to this now is burning down to get some reality checking. It's been already overdue, but it is now happening.

I have seen a lot of people be very excited about NoSQL databases; and these databases do have their use cases. However, you just cannot put everything into NoSQL mode and expect it to just work. That is why we have recently seen NewSQL databases, query languages for NoSQL databases etc. - and in fact, they all move towards relational database management again.

Sound new database systems seem to be mostly made up of three aspects: in-memory optimization instead of optimizing for on-disk operation (memory just has become a lot cheaper the last 10 years), the execution of queries on the servers that store the actual data (you may want to call this "stored procedures" if you like SQL, or "map-reduce" if you like buzzwords), and optimized memory layouts (many of the benefits of "colum store" databases come from having a single, often primitive, datatype for the full column to scan, instead of alternating datatypes in records.

However, here is one point you need to consider:
is your data actually this "big"? Big as in: Google scale.

I see people use big data and Hadoop a lot when they just shouldn't. I see a lot of people run Hadoop in a VM on their laptop. Ouch.

The big data technologies are not a one-size-fits-all solution. They are the supersize-me solution, and supersize just does not fit every task.

When you look at the cases where Hadoop is really successful, it is mostly in keeping the original raw data, and enabling people to re-scan the full data again when e.g. their requirements changed. This is where Hadoop is really good at: managing 100 TB of data, and allowing you to quickly filter out the few GB that you really need for your current task.

For the actual analysis - or when you don't have 100 TB, and a large cluster anyway - then just don't try to hadoopify everything.

Here is a raw number from a current experiment. I have a student work on indexing image data; he is implementing some state of the art techniques. For these, a large number of image features are extracted, and then clustering is used to find "typical" features to improve image search.

The benchmarking image set is 56 GB (I have others with 2 TB to use next). The subset the student is currently processing is 13 GB. Extracting 2.3 million 128 dimensional feature vectors reduces this to about 1.4 GB. As you can see, the numbers drop quickly.

State of the art seems to be to load the data into Hadoop, and run clustering (actually, this is more of a vector quantization than clustering) into 1000 groups. Mahout is the obvious candidate to run this on Hadoop.

However, as I've put a lot of effort into the data mining software ELKI, I considered also to try processing it in ELKI.

By cutting the data into 10 MB blocks, Mahout/Hadoop can run the clustering in 52x parallel mappers. k-Means is an iterative algorithm, so it needs multiple processes. I have fixed the number of iterations to 10, which should produce a good enough approximation for my use cases.

K-means is embarrassingly parallel, so one would expect the cluster to really shine at this task. Well, here are some numbers:

Mahout k-Means took 51 minutes on average per iteration (The raw numbers are 38:24, 62:29, 46:21, 56:20, 52:15, 45:11, 57:12, 44:58, 52:01, 50:26, so you can see there is a surpisingly high amount of variance there).
ELKI k-Means on a single CPU core took 4 minutes 25 seconds per iteration, and 45 minutes total, including parsing the input data from an ascii file. Maybe I will try a parallel implementation next.

So what is happening? Why is ELKI beating Mahout by a factor of 10x?

It's (as always) a mixture of a number of things:

ELKI is quite well optimized. The Java Hotspot VM does a good job at optimizing this code, and I have seen it to be on par with R's k-means, which is written in C. I'm not sure if Mahout has received a similar amount of optimization yet. (In fact, 15% of the Mahout runtime was garbage collection runtime - indicating that it creates too many objects.)
ELKI can use the data in a uniform way, similar to a column store database. It's literally crunching the raw double[] arrays. Mahout on the other hand - as far as I can tell - is getting the data from a sequence file, which then is deserialized into a complex object. In addition to the actual data, it might be expecting sparse and dense vectors mixed etc.
Size: this data set fits well into memory. Once this no longer holds, ELKI will no longer be an option. Then MapReduce/Hadoop/Mahout wins. In particular, such an implementation will by design not keep the whole data set in memory, but need to de-serialize it from disk again on each iteration. This is overhead, but saves memory.
Design: MapReduce is designed for huge clusters, where you must expect nodes to crash during your computation. Well, chances are that my computer will survive 45 minutes, so I do not need this for this data size. However, when you really have large data, and need multiple hours on 1000 nodes to process it, then this becomes important to survive losing a node. The cost is that all interim results are written to the distributed filesystem. This extra I/O comes, again, at a cost.

Let me emphasize this: I'm not saying, Hadoop/Mahout is bad. I'm saying: this data set is not big enough to make Mahout beneficial.

Conclusions: As long as your data fits into your main memory and takes just a few hours to compute, don't go for Hadoop.
It will likely be faster on a single node by avoiding the overhead associated with (current) distributed implementations.

Sometimes, it may also be a solution to use the cluster only for preparing the data, then get it to a powerful workstation, and analyze it there. We did do this with the images: for extracting the image features, we used a distributed implementation (not on Hadoop, but on a dozen PCs).

I'm not saying it will stay that way. I have plans for starting "Project HELKI", aka "ELKI on Hadoop". Because I do sometimes hit the barrier of what I can compute on my 8 core CPU in an "reasonable" amount of time. And of course, Mahout will improve over time, and hopefully lose some of its "Java boxing" weight.

But before trying to run everything on a cluster, always try to run it on a single node first. You can still figure out how to scale up later, once you really know what you need.

And last but not least, consider whether scaling up really makes sense. K-means results don't really get better with a larger data set. They are averages, and adding more data will likely only change the last few digits. Now if you have an implementation that doesn't pay attention to numerical issues, you might end up with more numerical error than you gained from that extra data.

In fact, k-means can effectively be accelerated by processing a sample first, and only refining this on the full data set then. And: sampling is the most important "big data" approach. In particular when using statistical methods, consider whether more data can really be expected to improve your quality.

Better algorithms: K-means is a very crude heuristic. It optimizes the sum of squared deviations, which may be not too meaningful for your problem. It is not very noise tolerant, either. And there are thousands of variants. For example bisecting k-means, which no longer is embarrassingly parallel (i.e. it is not as easy to implement on MapReduce), but took just 7 minutes doing 20 iterations for each split. The algorithm can be summarized as starting with k=2, then always splitting the largest cluster in two until you have the desired k. For many use cases, this result will be just as good as the full k-means result.

Don't get me wrong. There are true big data problems. Web scale search, for example. Or product and friend recommendations at Facebook scale. But chances are that you don't have this amount of data. ~~Google probably doesn't employ k-means at that scale either.~~ (actually, Google runs k-means on 5 mio keypoints for vector quantization; which, judging my experience here, can still be done one a single host; in particular with hierarchical approaches such as bisecting k-means)

Don't choose the wrong tool for the wrong problem!

20 August 2013

Erich Schubert: Google turning evil

I used to be a fan of the Google company. I used to think of this as a future career opportunity, as I've received a number of invitations to apply for a job with them.

But I am no longer a fan.

Recently, the company has in my opinion turned to the "evil" side; they probably became worse than Microsoft ever was and the "do no evil" principle is long gone.

So what has changed? In my perception, Google:

Is much more focused on making money now, than making technology happen.
Instead of making cool technology, it tries more and more to be a "hipster" thing and following the classic old 80-20 rule: get 80% of the users with 20% of the effort. Think of the GMail changes that received so much hate and the various services it shuts down (Code search, Reader, Googlecode downloads) and how much Google is becoming a walled garden.
Google leverages its search market dominance as well as Android to push their products that don't perform as good as desired. Plus for example. There is a lot of things I like about plus. In particular when you use it as a "blog" type of conversation tool instead of a site for sharing your private data, then it's much better than Facebook because of the search function and communities. (And I definitely wouldn't cry if a Facebook successor emerges). But what I hate is how Google tries hard to leverage their Android and YouTube power to force people to Plus. And now they are spreading even further, into TV (ChromeCast) and ISP (Google Fiber) markets.
Hangups, oops, Hangouts is another such example. I've been using XMPP for a long time. At some point, Google started operating an XMPP server, and it would actually "federate" with other servers, and it worked quite nicely. But at some point they decided they need to have more "hipster" features, like ponies. Or more likely, they decided that they want all these users to use Plus. So now they are moving it away from an open standard towards a walled garden. But since 90% of my use is XMPP outside of Google, I will instead just move away from them. Worse.
Reader. They probably shut this down to support Currents and on the long run move people over to Plus, too. Fortunately here, there now exist a number of good alternatives such as Feedly. But again: Google failed my needs.
Maps. Again an example where they moved from "works good" to "hipster crap". The first thing the new maps always greets me with is a gray screen. The new 3D view looks like an earthquake happened, without offering any actual benefit over the classic (and fast) satellite imagery. Worse. In fact my favorite map service right now is an OpenStreetmap offline map.
Google Glass is pure hipsterdom crap. I have yet to see an actual use case for it. People use it to show off, and that's about it. If you are serious about photos, you use a DSLR camera with a large objective and sensor. But face it: it's just distraction. If you want productive, it's actually best to turn of all chat and email and focus. Best productivity tip ever: Turn off email notifications. And turn of chat, always-on and Glass, too.
Privacy. When privacy-aware US providers actually recommend against anyone trusting their private data to a company with physical ties to the United States, then maybe it's time to look out for services in countries that value freedom higher than spying. I cannot trust Google on keeping my data private.

We are living in worrisome times. The U.S.A., once proud defenders of freedom, have become the most systematic spies in the world, even on their own people. The surveillance in the Eastern Bloc is no match for this machinery built the last decade. Say hello to Nineteen Eighty-Four. Surveillance, in a mixture of multi-state and commercial surveillance has indeed become omnipresent. Wireless sensors in trash cans track your device MACs. Your email is automatically scanned both by the NSA and Google. Your social friendships are tracked by Facebook and your iPhone.

I'm not being paranoid. These things are real, and it can only be explained with a "brainwashing" with both the constantly raised fear of terror (note that you are much more likely to be killed in a traffic accident or by a random gun crazy in the U.S.A. than by terrorists) and the constant advertisement for useless technology such as Google Glass. Why have so many people stopped fighting for their freedom?

So what now? I will look into options to move away stuff from Google (that is mostly, my email -- although I'd really like to see a successor to email finally emerge). Maybe I can find a number of good services located e.g. in Switzerland or Norway (who still seem to hold freedom in high respect - neither the UK nor Germany are an alternative these days). And I hope that some politicians will have the courage to openly discuss whether it may be necessary to break up Google into "Boogles" (Baby-Googles), just as Reagan did with AT&T. But unfortunately, todays politicians are really bad at such decisions, in particular when it might lower their short-run popularity. They are only good at making excuses.

25 July 2013

Evgeni Golov: Say hello to Mister Hubert!

Some days ago I got myself a new shiny Samsung 840 Pro 256GB SSD for my laptop. The old 80GB Intel was just too damn small. Instead of just doing a pvmove from the old to the new, I decided to set up the system from scratch. That is an awesome way to get rid of old and unused stuff or at least move it to some lower class storage (read: backup). One of the things I did not bother to copy from the old disk were my ~/Debian, ~/Grml and ~/Devel folders. I mean, hey, it s all in some kind of VCS, right? I can just clone it new, if I really want. Neither I copied much of my dotfiles, these are neatly gitted with the help of RichiH s awesome vcsh and a bit of human brains (no private keys on GitHub, yada yada). After cloning a couple of my personal repos from GitHub to ~/Devel, I realized I was doing a pretty dumb job, a machine could do for me. As I already was using Joey s mr for my vcsh repositories, generating a mr config and letting mr do the actual job was the most natural thing to do. So was using Python Requests and GitHub s JSON API. And here is Mister Hubert, aka mrhub: github.com/evgeni/MisterHubert. Just call it with your GitHub username and you get a nice mr config dumped to stdout. Same applies for organizations.

Authentication for private repos? (-p)
Other clone mechanisms? (-c)
A help function? (-h)
Other features?

As usual, I hope this is useful :)

21 May 2013

Erich Schubert: Google Hangouts drops XMPP support

Update: today I've been receiving XMPP messages in the Google+ variant of Hangouts. Looks as if it currently is back (at least while you are logged in via XMPP - havn't tried without pidgin at the same time yet). Let's just hope that XMPP federation will continue to be supported on the long run.

It's been all over the internet, so you probably heard it already: Google Hangouts no longer receives messages from XMPP users. Before, you could easily chat with "federated" users from other Jabber servers.

While of course the various open-source people are not amused -- for me, most of my contacts disappeared, so I then uninstalled Hangouts to get back Google Talk (apparently this works if Talk was preinstalled in your phones firmware) -- this bears some larger risks for Google:

Reputation: Google used to have the reputation of being open. XMPP support was open, the current "Hangups" protocol is not. This continuing trend of abandoning open standards and moving to "walled garden" solutions will likely harm the companies reputation in the open source community
Legal risk of an antitrust action: Before, other competitors could interface with Google using an indepentend and widely accepted standard. An example is United Internet in Germany, which operates for example the Web.de and GMX platforms, mail.com, the 1&1 internet provider. By effectively locking out its competitors - without an obvious technical reason, as XMPP was working fine just before, and apparently continues to be used at Google for example in AppEngine - bears a high risk of running into an antitrust action in Europe. If I were 1&1, I would try to get my lawyers started... or if I were Microsoft, who apparently just wanted to add XMPP messaging to Hotmail?
Users: Google+ is not that big yet. Especially in Germany. Since 90% of my contacts were XMPP contacts, where am I likely going to move to: Hangouts or another XMPP server? Or back to Skype? I still use Skype for more Voice calls than Google (which I used like twice), because there are some people that prefer Skype. One of these calls probably was not using the Google plugin, but an open source phone. Because with XMPP and Jingle, my regular chat client would interoperate. An in fact, the reason I started using Google Talk the first place was because it would interoperate with other networks, too, and I assumed they would be good at operating a Jabber server.

In my opinion, Google needs to quickly restore a functioning XMPP bridge. It is okay if they offer add-on functionality only for Hangout users (XMPP was always designed to allow for add-on functionality); it is also okay if they propose an entirely new open protocol to migrate to on the long run, if they can show good reasons such as scalability issues. But the way they approached the Hangup rollout looks like a big #fail to me.

Oh, and there are other issues, too. For example Linus Torvalds complains about the fonts being screwed up (not hinted properly) in the new Google+, others complain about broken presence indicators (but then you might as well just send an email, if you can't tell whether the recepient will be able to receive and answer right away), but using Hangouts will apparently also (for now -- rumor has it that Voice will also be replaced by Hangups entirely) lose you Google Voice support. The only thing that seems to give positive press are the easter eggs...

All in all, I'm not surprised to see over 20% of users giving the lowest rating in the Google Play Store, and less than 45% giving the highest rating - for a Google product, this must be really low.

28 February 2013

Erich Schubert: ELKI data mining in 2013

ELKI, the data mining framework I use for all my research, is coming along nicely, and will see continued progress in 2013. The next release is scheduled for SIGMOD 2013, where we will be presenting the novel 3D parallel coordinates visualization we recently developed. This release will bear the version number 0.6.0.

Version 0.5.5 of ELKI is in Debian unstable since december (Version 0.5.0 will be in the next stable release) and Ubuntu raring. The packaged installation can share the dependencies with other Debian packages, so they are smaller than the download from the ELKI web site.

If you are developing cluster analysis or outlier detection algorithm, I would love to see them contributed to ELKI. If I get a clean and well-integrated code by mid june, your algorithm could be included in the next release, too. Publishing your algorithms in source code in a larger framework such as ELKI will often give you more citations. Because it is easier to compare with your algorithm then and to try it on new problems. And, well, citations counts are a measure that administration loves to judge researchers ...

So what else is happening with ELKI:

The new book "Outlier Analysis" by C. C. Aggarwal mentions ELKI for visual evaluation of outlier results as well as in the "Resources for the Practioner" section and cites around 10 publications closely related to ELKI.
Some classes for color feature extraction of ELKI have been contributed to jFeatureLib, a Java library for feature detection in image data.
I'd love to participate in the Google Summer of Code, but I need a contact at Google to "vouch" for the project, otherwise it is hard to get in. I've been sending a couple of emails, but so far have not heard back much yet.
As the performance of SVG/Batik is not too good, I'd like to see more OpenGL based visualizations. This could also lead to an Android based version for use on tablets.
As I'm not an UI guy, I would love to have someone make a fancier UI that still exposes all the rich functions we have. The current UI is essentially an automatically generated command line builder - which is nice, as new functionality shows up without the need to modify UI code. It's good for experienced users like me, but hard for beginners to get started.
I'd love to see integration of ELKI with e.g. OpenRefine / Google Refine to make it easier to do appropriate data cleaning and preprocessing
There is work underway for a distributed version running on Hadoop/YARN.

3 December 2012

Erich Schubert: ResearchGate Spam

Update Dec 2012: ResearchGate still keeps on sending me their spam. Most of the colleagues I had that tried out RG now deleted their account there, apparently, so the invitation mails become fewer. Please do not try to push this link on Wikipedia just because you are also annoyed by their emails. My blog is not a "reliable source" by Wikipedia standards. It solely reflects my personal view of that web site, not journalistic or scientific research. The reason why I call ResearchGate spam is the weasel words they use to trick authors into sending the invitation spam. Here's the text coming with the checkbox you need to uncheck (from the ResearchGate "blog")

Add my co-authors that are already using ResearchGate as contacts and invite those who are not yet members.

See how it is worded so it sounds much more like "link my colleagues that are already on researchgate" instead of "send invitation emails to my colleagues"? It deliberately avoids the mentioning of "email", too. And according to the researchgate news post, this is hidden in "Edit Settings", too (I never bothered to try it -- I do not see any benefit to me in their offers, so why should I?). Original post below:

If you are in science, you probably already received a couple of copies of the ResearchGate spam. They are trying to build a "Facebook for scienctists", and so far, their main strategy seems to be aggressive inivitation spam. So far, I've received around 5 of their "inivitations", which essentially sound like "Claim your papers now!" (without actually getting any benefit). When I asked my colleagues about these invitations none actually meant to invite me! This is why I consider this behaviour of ResearchGate to be spam. Plus, at least one of these messages was a reminder, not triggered by user interaction. Right now, they claim to have 1.9 million users. They also claim "20% interact at least once a month". However, they have around 4000 Twitter followers and Facebook fans, and their top topics on their network are at like 10000-50000 users. That is probably a much more real user count estimation: 4k-40k. And these "20%" that interact, might just be those 20% the site grew in this timeframe and that happened to click on the sign up link. For a "social networking" site, these numbers are pointless anyway. That is probably even less than MySpace. Because I do not see any benefit in their offers! Before going on an extremely aggressive marketing campaign like this, they really should consider to actually have something to offer... And the science community is a lot about not wasting their time. It is a dangerous game that ResearchGate is playing here. It may appeal to their techies and investors to artificially inflate their user numbers in the millions. But if you pay for the user numbers with your reputation, that is a bad deal! Once you have the reputation as being a spammer (and mind it, every scientist I've talked to so far complained about the spam and "I clicked on it only to make it stop sending me emails") it's hard to be taken serious again. The scientific community is a lot about reputation, and ResearchGate is screwing up badly on this. In particular, according to researchgate founder on quora, the invitations are opt-out on "claiming" a paper. Sorry, this is wrong. Don't make users annoy other users by sending them unwanted invitations to a worthless service! And after all, there are alternatives such as Academia and Mendeley that do offer much more benefit. (I do not use these either, though. In my opinion, they also do not offer enough benefit to bother going to their website. I've mentioned the inaccuracy of Mendeleys data - and the lack of an option to get them corrected - before in an earlier blog post. Don't rely on Mendeley as citation manager! Their citation data is unreviewed. I'm considering to send ResearchGate (they're Berlin based, but there maybe also is a US office you could direct this to) a cease and desist letter, denying them to store personal information on me, and to use my name on their websites to promote their "services". They may have visions of a more connected and more collaborative science, but they actually don't have new solutions. You can't solve everything by creating yet another web forum and "web2.0izing" everything. Although many of the web 2.0 bubble boys don't want to hear it: you won't solve world hunger and AIDS by doing another website. And there is a life outside the web.

21 November 2012

Axel Beckert: Suggestions for the GNOME Team

Thanks to Erich Schubert s blog posting on Planet Debian I became aware of the 2012 GNOME User Survey at Phoronix. Like back in 2006 I still use some GNOME applications, so I do consider myself as GNOME user in the widest sense and hence I filled out that survey. Additionally I have to live with GNOME 3 as a system administrator of workstations, and that s some kind of usage, too. ;-) The last question in the survey was Do you have any comments or suggestions for the GNOME team? Sure I have. And since I tried to give constructive feedback instead of only ranting, here s my answer to that question as I submitted it in the survey, too, just spiced up with some hyperlinks and highlighting:

Don t try to change the users. Give the users more possibilities to change GNOME if they don t agree with your own preferences and decisions. (The trend to castrate the user was already starting with GNOME 2 and GNOME 3 made that worse IMHO.) If you really think that you need less configurability because some non-power-users are confused or challenged by too many choices, then please give the other users at least the chance to enable more configuration options. A very good example in that hindsight was Kazehakase (RIP) who offered several user interfaces (novice, intermediate and power user or such). The popular text-mode web browser Lynx does the same, too, btw. GNOME lost me mostly with the change to GNOME 2. The switch from Galeon 1.2 to 1.3/2.0 was horrible and the later switch to Epiphany made things even worse on the browser side. My short trip to GNOME as desktop environment ended with moving back to FVWM (configurable without tons of clicking, especially after moving to some other computer) and for the browser I moved on to Kazehakase back then. Nowadays I m living very well with Awesome and Ratpoison as window managers, Conkeror as web browser (which are all very configurable) and a few selected GNOME applications like Liferea (luckily still quite configurable despite I miss Gecko s about:config since the switch to WebKit), GUCharmap and Gnumeric. For people switching from Windows I nowadays recommend XFCE or maybe LXDE on low-end computers. I likely would recommend GNOME 2, too, if it still would exist. With regards to MATE I m skeptical about its persistance and future, but I m glad it exists as it solves a lot of problems and brings in just a few new ones. Cinnamon as well as SolusOS are based on the current GNOME libraries and are very likely the more persistent projects, but also very likely have the very same multi-head issues we re all barfing about at work with Ubuntu Precise. (Heck, am I glad that I use Awesome at work, too, and all four screens work perfectly as they did with FVWM before.)

Thanks to Dirk Deimeke for his (German written) pointer to Marcus Moeller s interview with Ikey Doherty (in German, too) about his Debian-/GNOME-based distribution SolusOS.

Erich Schubert: Phoronix GNOME user survey

While not everybody likes Phoronix (common complaints include tabloid journalism), they are doing a GNOME user survey again this year. If you are concerned about Linux on the desktop, you might want to participate; it is not particularly long.

Unfortunately, "the GNOME Foundation still isn't interested in having a user survey", and may again ignore the results; and already last year you could see a lot of articles along the lines of The Survey That GNOME Would Rather Ignore. One more reason to fill it out.

13 November 2012

Erich Schubert: Migrating from GNOME3 to XFCE

I have been a GNOME fan for years. I actually liked the switch from 1.x to 2.x, and at some point switched to 3.x when it became somewhat usable. At some point, I even started some small Gnome projects, one even was uploaded to the Gnome repositories. But I didn't have much time for my Linux hobby anymore back then.

However, I am now switching to XFCE. And for all I can tell, I am about the last one to make that switch. Everybody I know hates the new Gnome.

My reason is not emotional. It's simple: I have systems that don't work well with OpenGL, and thus don't work well with Gnome shell. Up to now, I can live fine with "Fallback mode" (aka: Gnome classic). It works really good for me, and does exactly what I need. But it has been all over the media: Gnome 3.8 will drop 'fallback' mode.

Now the choice is obvious: instead of switching to shell, I go to XFCE. Which is much closer to the original Gnome experience, and very productivity oriented.

There are tons of rants on GNOME 3 (for one of the most detailed ones, see Gnome rotting in threes, going through various issues). Something must be very wrong about what they are doing to receive this many sh*tstorms all the time. Every project receives some. I've even received a share of the Gnome 2 storms when Galeon (an early Gnome browser) made the move and started dropping some of the hard-to-explain and barely used options that would break with every other Mozilla release. And Mozilla embedding was a major pain these days. Yet, for every feature there would be some user somewhere that loved it, and as Debian maintainer of Galeon, I got to see all the complaints (and at the same time was well aware of the bugs caused by the feature overload).

Yet with Gnome 3, things are IMHO a lot different. In Gnome 2, it was a lot about making things more usable as they are, a bit cleaner and more efficient. With Gnome 3, it seems to be about experimenting with new stuff. Which is why it keeps on breaking APIs all the time. For example themeing GTK 3 is constantly broken; most of the themes available just don't work. Similar Gnome Shell extensions - most of them work with exactly one version of Gnome Shell (doesn't this indicate the author has abandoned Gnome shell?).

But the one thing that was really sticking out was when my I updated the PC of my dad. Apart from some glitches, he could not even shutdown his PC with Gnome-shell. Because you needed to press the Alt button to actually get a shutdown option.

This is indicative of where Gnome is heading: something undefined inbetween of PCs, tablets, media centers and mobile phones. They just decided that users don't need to shutdown anymore, so they could as well drop that option.

But the worst thing about the current state of GNOME is: They happily live with it. They don't care that they are losing users by the dozens. Because to them, these are just "complainers". Of cousre there is some truth in "Complainers gonna complain and haters gonna hate". But what Gnome is receiving is way above average. At some point, they should listen. 200 posts long comment chains from dozens of peopls on LWN are not just your average "complaints". It's an indicator that a key user base is unhappy with the software. In 2010 GNOME 2 had 45% market share in the LinuxQuestions poll, XFCE had 15%. In 2011, GNOME 3 had 19%, and XFCE jumped to 28%. And I wouldn't be surprised if GNOME 3 shell (not counting fallback mode) would clock at less than 10% in 2012 - despite being default.

Don't get me wrong: there is a lot on Gnome that I really like. But as they decided to drop my preferred UI, I am of course looking for alternatives. In particular, as I can get lots of the Gnome 3 benefits with XFCE. There is a lot in the Gnome ecosystem that I value, and that IMHO is driving Linux forward. Network-manager, Poppler, Pulseaudio, Clutter just to name a few. Usually, the stuff that is modular is really good. And in fact I have been a happy user of the "fallback" mode, too. Yet, the overall "desktop" Gnome 3 goals are in my opinion targeting the wrong user group. Gnome might need to target linux developers more again, to keep a healthy development community around. Frequently triggering sh*tstorms by high-profile people such as Linux Torvalds is not going to strengthen the community. There is nothing wrong in the FL/OSS community to encourage people to use XFCE. But these are developers that Gnome might need at some point.

On a backend / technical level (away from the Shell/UI stuff that most of the rants are about), my main concern about the Gnome future is GTK3. GTK2 was a good toolkit for cross-platform development. GTK3 as of now is not, but is largely a Linux/Unix only toolkit - in particular, because there apparently is no up to date Win32 port. With GTK 3.4 it was said that they are now working on Windows - but as of GTK 3.6 they are still nowhere to be found. So if you want to develop cross-platform, as of now, you better stay away from GTK 3. If this doesn't change soon, GTK might sooner or later lose the API battle to more portable libraries.

2 November 2012

Erich Schubert: DBSCAN and OPTICS clustering

DBSCAN [wikipedia] and OPTICS [wikipedia] are two of the most well-known density based clustering algorithms. You can read more on them in the Wikipedia articles linked above.

An interesting property of density based clustering is that these algorithms do not assume clusters to have a particular shape. Furthermore, the algorithms allow "noise" objects that do not belong to any of the clusters. K-means for examples partitions the data space in Voronoi cells (some people claim it produces spherical clusters - that is incorrect). See Wikipedia for the true shape of K-means clusters and an example that canot be clustered by K-means. Internal measures for cluster evaluation also usually assume the clusters to be well-separated spheres (and do not allow noise/outlier objects) - not surprisingly, as we tend to experiment with artificial data generated by a number of Gaussian distributions.

The key parameter to DBSCAN and OPTICS is the "minPts" parameter. It roughly controls the minimum size of a cluster. If you set it too low, everything will become clusters (OPTICS with minPts=2 degenerates to a type of single link clustering). If you set it too high, at some point there won't be any clusters anymore, only noise. However, the parameter usually is not hard to choose. If you for example expect clusters to typically have 100 objects, I'd start with a value of 10 or 20. If your clusters are expected to have 10000 objects, then maybe start experimenting with 500.

The more difficult parameter for DBSCAN is the radius. In some cases, it will be very obvious. Say you are clustering users on a map. Then you might know that a good radius is 1 km. Or 10 km. Whatever makes sense for your particular application. In other cases, the parameter will not be obvious, or you might need multiple values. That is when OPTICS comes into play.

OPTICS is based on a very clever idea: instead of fixing MinPts and the Radius, we only fix minpts, and plot the radius at which an object would be considered dense by DBSCAN. In order to sort the objects on this plot, we process them in a priority heap, so that nearby objects are nearby in the plot. this image on Wikipedia shows an example for such a plot.

OPTICS comes at a cost compared to DBSCAN. Largely because of the priority heap, but also as the nearest neighbor queries are more complicated than the radius queries of DBSCAN. So it will be slower, but you no longer need to set the parameter epsilon. However, OPTICS won't produce a strict partitioning. Primarily it produces this plot, and in many situations you will actually want to visually inspect the plot. There are some methods to extract a hierarchical partitioning out of this plot, based on detecting "steep" areas.

The open source ELKI data mining framework (package "elki" in Debian and Ubuntu) has a very fast and flexible implementation of both algorithms. I've benchmarked this against GNU R ("fpc" package") and Weka, and the difference is enormous. ELKI without index support runs in roughly 11 minutes, with index down to 2 minutes for DBSCAN and 3 minutes for OPTICS. Weka takes 11 hours and GNU R/fpc takes 100 minutes (DBSCAN, no OPTICS available). And the implementation of OPTICS in Weka is not even complete (it does not support proper cluster extraction from the plot). Many of the other OPTICS implementations you can find with Google (e.g. in Python or MATLAB) seem to be based on this Weka version ...

ELKI is open source. So if you want to peek at the code, here are direct links: DBSCAN.java, OPTICS.java.

Some part of the code may be a bit confusing at first. The "Parameterizer" classes serve the purpose of allowing automatic UI generation, for example. So there is quite a bit of meta code involved.

Plus, ELKI is quite extensively optimized. For example, it does not use Java Collections much anymore. Java Iterators, for example, require returning an object on next();. The C++ style iterators used by ELKI can have multiple values, and primitive values.

for(DBIDIter id = relation.iterDBIDs(); id.valid(); id.advance())

is a typical for loop in ELKI, iterating over all objects of a relation, but the whole loop requires creating (and GC'ing) a single object. And actually, this is as literal as a for loop can get.

ModifiableDBIDs processedIDs = DBIDUtil.newHashSet(size);

is another example. Essentially, this is like a HashSet<DBID>. Except that it is a lot faster, because the object IDs do not need to live a Java objects, but can internally be stored more efficiently (the only currently available implementation of the DBID layer uses primitive integers).

Java advocates always accuse you of premature optimization when you avoid creating objects for primitives. Yet, in all my benchmarking, I have seen this continuously to have a major impact how many objects you allocate. At least when it is inside a loop that is heavily used. Java collections with boxed primitives just eat a lot of memory, and the memory management overhead does often make a huge difference. Which is why libraries such as Trove (which ELKI uses a lot) exist. Because memory usage does make a difference.

(Avoiding boxing/unboxing systematically in ELKI yielded approximately a 4x speedup. But obviously, ELKI involves a lot of numerical computations.)

22 October 2012

Erich Schubert: Changing Gnome 3 colors

One thing that many people dislike about Gnome 3, in my opinion is that the authors/maintainers impose a lot of decisions on you. They are in fact not really hard coded, but I found documentation to be really inaccessble on how to change them.

For example colors. I found it extremely badly documented on how to customize GTK colors. And at the same time, a lot of the themes do not work reliably across different Gnome versions. For example the unico engine in Debian experimental is currently incompatible with the main GTK version there (and even worse, GTK does not realize this and refuse to load the incompatible engine). A lot of the themes you can get on gnome-look.org for example use unico. So it's pretty easy to get stuck with a non-working GTK 3, this really should not happen that easily. (I do not blame the Debian maintainers to not have worked around this using package conflicts yet - it's in experimental after all. But upstream should know when they are breaking APIs!)

For my work on the ELKI data mining framework I do a lot of work in Eclipse. And here GTK3 really is annoying, in particular the default theme. Next to unusable, actually, as code documentation tooltips show up black-on-black.

Recently, Gnome seems to be mostly driven by a mix of design and visual motivation. Gnome shell is a good example. No classic Linux user I've met likes it, even my dad immediately asked me how to get the classic panels back. It is only the designers that seem to love it. I'm concerned that they are totally off on their audience, they seem to target the mac OSX users instead of the Linux users. This is a pity, and probably much more a reason why Gnome so far does not succeed on the Desktop: it keeps on forgetting the users it already has. They by now seem to move to XFCE and LXDE because neither the KDE nor the Gnome crowd care about classic Linux users in the hunt for copying OSX & Co.

Anyway, enough ranting. Here is a simple workaround -- that hopefully is more stable across GTK/Gnome versions than all those themes out there -- that just slightly adjusts the default theme:

$ gsettings set \
org.gnome.desktop.interface gtk-color-scheme '
os_chrome_fg_color:black;
os_chrome_bg_color:#FED;
os_chrome_selected_fg_color:black;
os_chrome_selected_bg_color:#f5a089;
selected_fg_color:white;
selected_bg_color:#c50039;
theme_selected_fg_color:black;
theme_selected_bg_color:#c50039;
tooltip_fg_color:black;
tooltip_bg_color:#FFC;
'

This will turn your panel from a designer-hip black back to a regular grayish work panel. If you are working a lot with Eclipse, you'll love the last two options. That part makes the tooltips readable again! Isn't that great? Instead of caring about what is the latest hipster colors, we now have readable tooltips for developers again instead of all that fancy-schmanzy designer orgasms!

Alternatively, you can use dconf-editor to set and edit the value. The tricky part was to find out which variables to set. The (undocumented?) os_chrome stuff seems to be responsible for the panel. Feel free to change the colors to whatever you prefer!

GTK is quite customizable. And the gsettings mechanism actually is quite nice for this. It just seems to be really badly documented. The Adwaita theme in particular seems to have quite some hard-coded relationships also for the colors. And I havn't found a way (without doing a complete new theme) to just reduce padding, for example. In particular, as there probably are a hundred of CSS parameters that one would need to override to get it into everywhere (and with the next Gnome, there will be again two dozen to add?)

Above method just seems to be the best way to tweak the looks. At least the colors, since that is all that you can do this way. If you want to customize more, you probably have to do a complete theme. At which point, you probably have to redo this at every new version. And to pick on Miguel de Icaza: the kernel APIs are extremely stable, in particular compared to the mess that Gnome has been across versions. And at every new iteration, they manage to offend a lot of their existing users (and end up looking more and more like Apple - maybe we should copy more from where we are good at, instead of copying OSX and .NET?).

9 September 2012

Erich Schubert: Google Plus replacing blogs not Facebook

When Google launched Google+, a lot of people were very sceptical. Some outright claimed it to be useless. I must admit, it has a number of functions that really rock.

Google Plus is not a Facebook clone. It does not try to mimick Facebook that much. To me, it looks much more like a blog thing. A blog system, where everybody has to have a Google account, and then can comment (plus, you can then restrict access and share only with some people). It also encourages you to share shorter posts. Successful blogs always tried to make their posts "articles". Now the posts themselves are merely comments; but not as crazy short as Twitter (it is not a Twitter clone either), and it does have rich media contents, too.

Those who expect it to replace their Facebook where the interaction is all about personal stuff will be somewhat disappointed. Because it IMHO much less encourages the smalltalk type of interaction.

However, it won a couple of pretty high profile people to share their thoughts and web discoveries with the world. Some of the most active users I follow on Google Plus are: Linus Torvalds and Tim O'Reilly (of the publishing house O'Reilly)

Of course I also have a number of friends that share private stuff on Google Plus. But in my opinion the strength of Google Plus is on sharing publicly. Since Google is the king of search, they can both feed shares of your friends into your regular search results, but there is also a pretty interesting search in Google PLus. The key difference is that with this search, the focus is on what is new. Regular web search is also a lot about searching for old things (where you did not bother to remember the address or bookmark the site - and mind it, today a lot of people even "google for Google" ...) For example I like the plus search for data mining because it occasionally has some interesting links in it. A lot of the stuff is coming in again and again, but using the "j and k" keys, I can quickly scroll through these results to see if there is anything interesting. And there are quite a lot of interesting things I've discovered this way.

Note that this can change anytime. And maybe it is because I'm interested in technology stuff that it works well for me. But say, maybe you are more into HDR photography than me (I think they look unreal, as if someone has done way too much contrast and edge enhancing on the image). But go there, and press "j" a number of times to browse through some HDR shots. That is a pretty neat search function there. And if you come back tomorrow, there will likely be new images!

Facebook tried to clone this functionality. Google+ launched in June 2011, and in September 2011, Facebook added "subscribers". So they realized the need for having "non-friends" that are interested in what you are doing. Yet, I don't know anybody actually using it. And the Public posts search is much less interesting than of Google Plus, and the nice keyboard navigation is also missing.

Don't get me wrong, Facebook still has its uses. When I travel, Facebook is great for me to get into contact with locals to go swing dancing. There are a number of events where people only invite you on Facebook (and that is one of the reasons why I've missed a number of events - because I don't use Facebook that much). But mind it, a lot of the stuff that people share on Facebook is also really boring.

And that will actually be the big challenge for Google: keeping the search results interesting. Once you have millions of people there sharing pictures of lolcats - will it still return good results? Or will just about every search give you more lolcats?

And of course, spam. The SEO crowd is just warming up in exploring the benefits of Google Plus. And there are quite some benefits to be gained from connecting web pages to Google Plus, as this will make your search results stick out somehow, or maybe give them that little extra edge over other results. But just like Facebook at some point was so heavily spammed when every little shop was setting up his Facebook pages, inviting everyone to all the events and so on - this is bound to happen on Google Plus, too. We'll see how Google then reacts, and how quickly and effectively.

2 September 2012

Erich Schubert: ELKI call for contributions

ELKI is a data mining software project that I have been working on for the last years as part of my PhD research. It is open source (AGPL-3 licensed) and avilable as both a Debian package and Ubuntu package in the official repositories. So a simple aptitude install elki should get you going and give you a menu entry for ELKI. These packages come with the scripts elki to launch the MiniGUI and elki-cli to run from command line. The key feature that sets ELKI apart from existing open source tools used in data mining (e.g. Weka and R) is that it has support for index structures to speed up algorithms, and a very modular architecture that allows various combinations of data types, distance functions, index structures and algorithms. When looking for performance regressions and optimization potential in ELKI, I recently ran some benchmarks on a data set with 110250 images described by 8 dimensional color histograms. This is a decently sized dataset: it takes long enough (usually in the range of 1-10 minutes) to measure true hotspots. When including Weka and R in the comarison I was quite surprised: our k-means implementation runs at the same speed as Rs implementation in C (and around twice that of the more flexible "flexclus" version). For some of the key agorithms (DBSCAN, OPTICS, LOF) we are an order of magnitude faster than Weka and R, and adding index support speeds up the computation by another factor of 5-10x. In the most extreme case - DBSCAN in Weka vs. DBSCAN with R-tree in ELKI - the speedup was a factor of 330x, or 2 minutes (ELKI) as opposed to 11 hours (Weka).
The reason why I was suprised is that I expected ELKI to perform much worse. It is written in Java (as opposed to R's kmeans, which is in C), uses a very flexible architecture which for example does not assume distances to be of type double and just has a lot of glue code inbetween. However, obviously, the Java Hotspot compiler actually lives up to its expectations and manages to inline the whole distance computations into k-means, and then compiles it at a level comparable to C. R executes vectorized operations quite fast, but on non-native code as in the LOF example it can become quite slow, too. (I would not take Weka as reference, in particular with DBSCAN and OPTICS there seems to be something seriously broken. Judging from a quick look at it, the OPTICS implementation actually is not even complete, and both implementations actually copy all data out of Weka into a custom linear database, process it there, then feed back the result into Weka. They should just drop that "extension" altogether. The much newer and Weka-like LOF module is much more comparable.) Note that we also have a different focus than Weka. Weka is really popular for machine learning, in particular for classification. In ELKI, we do not have a single classification algorithm because there is Weka for that. Instead, ELKI focuses on cluster analysis and outlier detection. And ELKI has a lot of algorithms in this domain, I dare to say the largest collection. In particular, they are all in the same framework, so they can be easily compared. R does of course have an impressive collection in CRAN, but in the end they do not really fit together. Anyway, ELKI is a cool research project. It keeps on growing, we have a number of students writing extensions as part of their thesis. It has been extremely helpful for me in my own research, as I could quickly prototype some algorithms, then try different combinations and use my existing evaluation and benchmarking. You need some time to get started (largely because of the modular architecture, Java generics and such hurdles), but then it is a very powerful research tool. But there are just many more algorithms, published sometime, somewhere, but barely with source code available. We'd love to get all these published algorithms into ELKI, so researchers can try them out. And enhance them. And use them for their actual data. So far, ELKI was mostly used for algorithmic research, but it's starting to move out into the "real" world. More and more people that are not computer scientists start using ELKI to analyze their data. Because it has algorithms that no other tools have. I tried to get ELKI into the "Google Summer of Code", but it was not accepted. But I'd really like to see it gain more traction outside the university world. There are a number of cool projects associated with ELKI that I will not be able to do myself the next years, unfortunately.

A web browser frontend would be cool. Maybe even connected to Google Refine, using Refine for preprocessing the data, then migrating it into ELKI for analysis. The current visualization engine of ELKI is using SVG - this should be fairly easy to port into the web browser. Likely, the web browers will even be faster than the current Apache Batik renderer.
Visual programming frontend. Weka, RapidMiner, Orange: they all have visual programming style UIs. This seems to work quite well to model the data flow within the analysis. I'd love to see this for ELKI, too.
Cluster/Cloud backend. ELKI can already handle fairly large data sets on a big enough system. If someone spends extra effort on the index structures, the data won't even need to fit into main memory anymore. Yet, everybody now wants "big data", and parallel computation probably is the future. I'm currently working on some first Hadoop YARN based experiments with ELKI. But this is a huge field, turning this into true "elk yarn". I will likely only lay some foundations (unless I get funding to continue as a PostDoc on this project. I sure hope to get to do at least a few years of postdoc somewhere, as I really enjoy working with students on this kind of project)
New visualization engine. The current visualization engine, based on Apache Batik and SVG is quite okay. It does what I need, which is to get a quick glance at the results and the ability to export them for publications in print quality. (in particular, I can easily edit the SVG files with Inkscape) But it is not really something fancy (although we have a lot of cool visualizations). And it is slow. I havn't found a portable and fast graphics toolkit for Java yet that can produce SVG files. There is a lot of hype around processing, for example, but it seems to be too much about art for me. In fact, I'd love to use either something like Clutter or Cairo. But getting them to work for Windows and Mac OSX will likely be a pain.
Human Computer Interaction (HCI). This is in my opinion the biggest challenge we are facing with all the "big data" stuff. If you really go into big data (and not just run Hadoop and Mahout on a single system; yes - a lot of people seem to do this), you will at some point need to go beyond just crunching the numbers. So far, the challenges that we are tackling are largely data summarization and selection. TeraSort is a cool project, and a challenge. Yet, what do you actually get from sorting this large amount of data? What do you get from running k-means on a terabyte? When doing data mining on a small data set, you quickly learn that the main challenge actually is preprocessing the data and choosing parameters the right way so that your result is not bogus. Unless you are doing simple prediction tasks, you often don't have a clearly defined objective. Sure, when predicting churn rates, you can hope to just throw all the data into a big cloud and hope you get some enlightement out. But when you are doing cluster analysis or outlier detection - unsupervised methods - the actual objective by definition cannot be hardcoded into a computer. The key objective then is learn something new on the data set. But if you want to have your user learn something on the data set, you will have to have the user guide the whole process, and you will have to present results to the user. Which gets immensely more difficult with larger data. Big data just does no longer look like this. And neither are the algorithms as simple as k-means or hierarchical clustering. Hierarchical clustering is good for teaching the basic ideas of clustering. But you will not be using a dendrogram for a huge data set. Plus, it has a naive complexity of O(n^3) and for some special cases O(n^2) - too slow for truly big data.
For the "big data future" once we get over all the excitement of being able to just somehow crunch these numbers we will need to seriously look into what to do with the results (in particular, how to present them to the user), and how to make the algorithms accessible and usable for non-techies. Right now, you cannot expect a social sciences researcher to be able to use a Hadoop cluster. Yet to make sense of the results. But if you are a smart guy to actually solve this, and open up "big data processing" to the average non-IT user, this will be big.
Oh, and of course there are just hundreds of algorithms not yet available (accessible) as open source. Not in ELKI, and usually not anywhere else either. Just to name a few from my wishlist (I could probably implement many of them in a few hours in ELKI, but I don't have the time to do so myself, plus they are good student or starter project to get used to ELKI): BIRCH, CLARA, CLARANS, CLINK, COBWEB, CURE, DOC, DUSC, EDSC, INSCY, MAFIA, P3C, SCHISM, STATPC, SURFING, ... just to name a few.

If you are a researcher in cluster analysis or outlier detection, consider contributing your algorithms to ELKI. Spend some time optimizing them, adding some documentation. Because, if ELKI keeps on growing and gaining popularity, it will be the future benchmark platform. And this can give you citations, which are somewhat the currency of science these days. Algorithms available in the major toolkits just do get cited more, because people compare to them. See this list for an overview of work cited by ELKI - scientific work that we reimplemented at least to some extend for ELKI. It is one of the services that we provide with ELKI for researchers: not only the algorithm, but also the appropriate citation.

30 August 2012

Erich Schubert: Finding packages for deinstallation

On my netbook, I try to keep the amount of installed software limited. Aptitudes "automatically installed" markers are very helpful here, since they allow you to differentiate between packages that were deliberately installed and packages that were manually marked for installation. I quite often browse through the list of installed packages and recheck those that are not marked as "A".

However, packages that are "suggested" by some other package (but not "required") will be kept even when marked as automatically. This is quite sensible: when you deinstall the package that "suggested" them, they will be removed. So this is nice for having optional software also automatically removed.

However sometimes you need the core package but not this optional functionality. Aptitude can help you there, too. Here's an aptitude filter I used to find some packages for removal:

!?reverse-depends(~i) ~M !?essential

It will display only packages with no direct dependency from another installed package and that are marked as automatically installed (so they must be kept installed because of a weaker dependency.

Some examples of "suggested but not required" packages:

Accessibility extensions of Gnome
Spelling dictionaries
Optional functionality / extensions

Depending on your requirements, you might want to keep some of these and remove others.

Here is also a filter to find packages that you can put on "automatically installed":

~i !~M ?reverse-depends(~i) !?essential

This will catch "installed but not automatically installed packages, that another installed package depends on". Note that you should not blindly put all of these to "automatic" mode. For example "logrotate" depends on "cron anacron fcron". If you have both cron and anacron installed, aptitude will consider anacron to be unnecessary (it is - on a system with 24h uptime). So review this list, and see what happens when you set packages to "A", and reconsider your intentions. If it is a software you want for sure, leave it on manual.

15 June 2012

Erich Schubert: Dritte Startbahn - 2 gewinnt!

Usually I don't post much about politics. And this even is a highly controversial issue. Please do not feel offended.

This weekend, there is an odd election in Munich. Outside of Munich, nearby the cities of Freising and Erding, there is the Munich airport. The company operating the airport is owned partially by the city of Munich, which gives the city a veto option.

The Munich airport has grown a lot. Everybody who has been flying a bit knows that big airports (such as Heathrow) are oft the worst. If anything goes wrong, you are busted, because it will take them a long time to resume operations. This just happened to me in Munich, where the luggage system was down, and no luggage arrived at the airplanes.

Yet, they want to take the airport further down this road, and make it even bigger: add two satellite terminals, and a third runway. I'm convinved that this will make the airport much worse for anybody around here. The security checkpoints will be even more crowded, the lines for the baggage drop-off too, and you will have to walk much further on the airport.

Up to now, the Munich airport was pretty good compared to others. In particular given that is is one of the largest in Europe! It is because it had been designed from the ground up for this size. Now they plan to screw it up.

But there are other arguments against this, not the egoist view of a traveller. The first is the money issue. The airport is continuously making losses. It's the taxpayer that has to pay for all of this - and the current cost estimation is 1200 million. This is not appropriate, in particular since history shows that you can take this x2 to x10 to get the real number. They should first get the airport into a financial stable condition, then plan on making it even bigger.

Then there are the numbers. Just like with any polticial large-scale project, the numbers are all fake. The current airport was planned to cost 800 million, in the end it was about 8550 million. The politicians happily lie to us. Because they want to push their pet projects. We must no longer accept such projects based on fake numbers and old predictions.

If you are already one of the 10 largest airports in Europe, can you really expect to grow even further?!? There is a natural limit to growth, unless you want to have every single passenger on the world first travel to Munich multiple times, then go to his final destination ...

One thing they seem to have completely neglected is that Berlin is currently getting a brand new airport. And very likely, this is going to divert quite some traffic away from Munich. Just like the Munich airport diverted a lot of traffic away from Frankfurt. To some extend because many people actually want to go to Berlin, not Munich, but they currently have to change planes here or in Frankfurt. So when Berlin finally is operational, this will have an impact on Munich.

And speaking of the Berlin airport, this is a good example to not trust the numbers and our politicians. It is another example of a way-over-budget, way-behind-time project the politicians screwed up badly and where they lied to us. If we should not have trusted them with Berlin, why should we trust them with the Munich addon?

A lot of people whose families have been living there for years will have to be resettled. Whole towns are going to disappear. An airport is huge. Yet, they cannot vote against it, because their small towns do not own shares of the airport. The polticians don't even talk to them, not even to their poltical representatives.

Last but not least, the airport is in a sensitive ecological area. The directly affected area is an European special protection area for wild birds. There are nature preserves nearby, and all this area already suffers badly from airport drainage, pollution and noise. When they built the airport originally, the replacement areas they setup were badly done, and are mostly inhabited by nettles and goldenrod (which is not even native to Europe). See this article in S ddeutsche Zeitung on the impact on nature. You can't replace the loss of the original habitats just by digging some pools and filling them with water ...

If you want more information, go to this page, by Bund Naturschutz.

This is not about progress ("Fortschritt"). That is a killer argument the politicians love, but it doesn't hold. Nobody is trying to shut down the airport. Munich will be better of by keeping the balance having both a reasonably sized airport (and in fact, the airport is already one of the 10 largest in Europe!) and preserving some nature to make it worth living here.

If you are located in Munich, please go vote against the airport extension, and enjoy the DEN-GER soccer game afterwards. Thank you.

9 June 2012

Erich Schubert: DMOZ dieing

Sometime in the late 1990s I became a DMOZ editor for my local area. At that time, when the internet was a nieche thing and I was still a kid, I was actually operating a web site that had a similar goal as the corresponding category for a non-profit organization.

In the following years, I would occasionally log in, try to review some pages. It was a really scary experience: it was still exactly the same, web 0.5 experience. You had a spreadsheet type of view, tons of buttons, and it would take like 10 page loads to just review a single site. A lot of the time, you would end up search a more appropriate category, copy the URL, replace some URL-encoded special characters, paste it in one out of 30 fields on the form just to move the suggested site to a more appropriate category. Most of the edits would be by bots that detected a dead link and disabled it by moving it to the review stage. While at the same time, every SEO manual said you need to be listed on DMOZ, so people would mass-submit all kinds of stuff to DMOZ in any category that it could in any way fit in.

Then AOL announced DMOZ 2.0. And everybody probably thought: about time to refresh the UI and make everything more usable. But it didn't. First of all, it came late (announced in 2008, actually delivered sometime in 2010), then it was incredibly buggy in the beginning. They re-launched 2.0 at least two times. For quite some time, editors would be unable to login.

When DMOZ 2.0 came, my account was already "inactive", but I was able to get it re-activated. And it still looked the same. I read they changed from Times to Arial, and probably changed some CSS. But other than that, it was still as complicated to edit links as you could make it. So I did just a few changes then lost interest largely again.

During the last year I must have tried to give it another try multiple times. But my account had expired again, and I never got a reply to my reinstatement request.

A year ago finall Google Directory - the most prominent use of DMOZ/ODP data, although the users were totally unaware of it - was discontinued, too.

So by now, DMOZ seems to be as dead as it can get (they don't even bother to answer former contributors that want to get reinstated). The links are old, and if it weren't for bots to disable dead sites, it would probably look like an internet graveyard. But this poses an interesting question: will someone come up with a working "web 2.0 social" idea of the "directory" concept (I'm not talking about Digg and these classic "social bookmarking" dead ducks)? Something that strikes the right balance of on one hand the web page admins (and the SEO gold diggers) being allowed to promote their sites (and keep the data accurate) and at the same time crowd-sourcing the quality control, while also opening the data? To some extend, Facebook and Google+ can do this, but they're largely walled gardens. But they don't have real social quality assurance; money is key there.

Next.

Previous.